Course: COMPUTATIONAL THINKING FOR GOVERNANCE ANALYTICS

Prof. José Manuel Magallanes, PhD

  • Visiting Professor of Computational Policy at Evans School of Public Policy and Governance, and eScience Institute Senior Data Science Fellow, University of Washington.
  • Professor of Government and Political Methodology, Pontificia Universidad Católica del Perú.

Dimensionality Reduction in R

As the name implies, we want to reduce a set of several variables into few variables. In this session, we will practice two basic techniques:

Cluster Analysis

Simply speaking, this technique will organize the cases (rows) into a small set of groups, based on the information (the columns) available for each case. This technique will create a new variable, which will be of categorical type, generally ordinal.

Let me bring back the data we prepared in Python:

# clean memory
rm(list = ls()) 

# link to data file
link='https://github.com/EvansDataScience/CTforGA_integrating/raw/main/demo_fragile.RDS'

# a RDS file from the web needs:
myFile=url(link)

# reading in the data:
fromPy=readRDS(file = myFile)

# reset indexes to R paradigm:
row.names(fromPy)=NULL

# check data types
str(fromPy)
## 'data.frame':    165 obs. of  21 variables:
##  $ Country                     : chr  "Norway" "New Zealand" "Finland" "Sweden" ...
##  $ Regimetype                  : Ord.factor w/ 4 levels "Authoritarian"<..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Overallscore                : num  9.75 9.37 9.27 9.26 9.18 9.09 9 8.9 8.9 8.88 ...
##  $ Electoralprocessandpluralism: num  10 10 10 9.58 10 10 10 10 9.58 9.58 ...
##  $ Functioningofgovernment     : num  9.64 8.93 9.29 9.29 8.21 8.93 7.86 8.57 8.93 8.93 ...
##  $ Politicalparticipation      : num  10 9.44 8.89 8.33 8.89 8.33 8.33 7.78 7.78 8.33 ...
##  $ Politicalculture            : num  10 8.75 8.75 10 9.38 9.38 9.38 8.75 9.38 8.75 ...
##  $ Civilliberties              : num  9.12 9.71 9.41 9.12 9.41 8.82 9.41 9.41 8.82 8.82 ...
##  $ Total                       : num  0.0419 0.2304 0 0.5445 0.1885 ...
##  $ C1SecurityApparatus         : num  1.8 1.4 2.5 2.7 0.7 1.7 2.7 2.7 1.6 2.2 ...
##  $ C2FactionalizedElites       : num  1.1 1.4 1.4 1.8 1.8 1.4 1.5 1.7 1 3.4 ...
##  $ C3GroupGrievance            : num  3.3 2.6 0.6 1.7 0.5 3.7 0.5 3.1 2.7 3.6 ...
##  $ E1Economy                   : num  1.9 3.4 2.9 1.9 3.4 1.7 2.7 1.6 2 2.2 ...
##  $ E2EconomicInequality        : num  1 2.1 1 1.7 1.3 1.2 1.6 1.8 1.8 1.6 ...
##  $ E3HumanFlightandBrainDrain  : num  0.8 1.6 1.5 0.7 1.9 1.3 2.8 0.5 1.1 2.4 ...
##  $ P1StateLegitimacy           : num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ P2PublicServices            : num  1.6 1.4 1.6 1.4 1.2 1.4 2.5 2.8 1.6 1.3 ...
##  $ P3HumanRights               : num  0.5 0.5 0.5 0.9 0.5 0.9 1.6 1.7 0.8 0.7 ...
##  $ S1DemographicPressures      : num  1.4 1.4 1.7 3.3 1.5 2.3 2.8 2.9 3.2 3.1 ...
##  $ S2RefugeesandIDPs           : num  2.2 1.6 1.5 4.3 1.5 2.2 1.4 2 3.1 2.6 ...
##  $ X1ExternalIntervention      : num  0.5 0.5 0.5 0.5 3.2 0.5 1.6 0.5 0.5 0.5 ...

I. Prepare data:

We have several countries, and several columns. In clustering, we try to create groups so that we have the highest homogeneity within groups, and the highest heterogeneity between groups. The variables will serve compute some distance among the cases so that the clustering algorithms will try to detect the homogeneity and heterogeneity mentioned:

  1. Subset the data: For this case just keep the columns with numeric values without the categories or the summary scores:
# select variables
dataToCluster=fromPy[,-c(1:3,9)]
#save the country names as the row index
row.names(dataToCluster)=fromPy$Country
  1. Decide if transformation is needed: Take a look at the data ranges:
summary(dataToCluster)
##  Electoralprocessandpluralism Functioningofgovernment Politicalparticipation
##  Min.   : 0.000               Min.   :0.000           Min.   : 0.000        
##  1st Qu.: 1.420               1st Qu.:2.500           1st Qu.: 3.890        
##  Median : 7.000               Median :5.000           Median : 5.560        
##  Mean   : 5.619               Mean   :4.616           Mean   : 5.377        
##  3rd Qu.: 9.170               3rd Qu.:6.430           3rd Qu.: 6.670        
##  Max.   :10.000               Max.   :9.640           Max.   :10.000        
##  Politicalculture Civilliberties  C1SecurityApparatus C2FactionalizedElites
##  Min.   : 1.250   Min.   :0.000   Min.   : 0.500      Min.   : 1.000       
##  1st Qu.: 3.750   1st Qu.:3.240   1st Qu.: 3.500      1st Qu.: 5.100       
##  Median : 5.000   Median :5.590   Median : 5.600      Median : 7.300       
##  Mean   : 5.351   Mean   :5.326   Mean   : 5.294      Mean   : 6.601       
##  3rd Qu.: 6.250   3rd Qu.:7.650   3rd Qu.: 6.900      3rd Qu.: 8.400       
##  Max.   :10.000   Max.   :9.710   Max.   :10.000      Max.   :10.000       
##  C3GroupGrievance   E1Economy     E2EconomicInequality
##  Min.   :0.500    Min.   :1.200   Min.   :1.000       
##  1st Qu.:4.000    1st Qu.:4.400   1st Qu.:3.600       
##  Median :5.800    Median :5.900   Median :5.200       
##  Mean   :5.758    Mean   :5.778   Mean   :5.255       
##  3rd Qu.:7.500    3rd Qu.:7.100   3rd Qu.:7.100       
##  Max.   :9.900    Max.   :9.800   Max.   :9.600       
##  E3HumanFlightandBrainDrain P1StateLegitimacy P2PublicServices P3HumanRights  
##  Min.   :0.500              Min.   : 0.500    Min.   : 1.200   Min.   :0.500  
##  1st Qu.:3.800              1st Qu.: 3.700    1st Qu.: 3.700   1st Qu.:3.400  
##  Median :5.500              Median : 6.400    Median : 5.400   Median :6.000  
##  Mean   :5.212              Mean   : 5.817    Mean   : 5.719   Mean   :5.447  
##  3rd Qu.:6.800              3rd Qu.: 8.200    3rd Qu.: 8.000   3rd Qu.:7.400  
##  Max.   :9.000              Max.   :10.000    Max.   :10.000   Max.   :9.800  
##  S1DemographicPressures S2RefugeesandIDPs X1ExternalIntervention
##  Min.   :1.400          Min.   : 0.700    Min.   : 0.500        
##  1st Qu.:4.100          1st Qu.: 2.800    1st Qu.: 3.400        
##  Median :5.700          Median : 4.300    Median : 5.400        
##  Mean   :5.901          Mean   : 4.802    Mean   : 5.124        
##  3rd Qu.:8.000          3rd Qu.: 6.600    3rd Qu.: 6.900        
##  Max.   :9.800          Max.   :10.000    Max.   :10.000
boxplot(dataToCluster,horizontal = T, las=2,cex.axis=0.4)

The data values have a similar range, then you do not need to transform the data. Possible alternatives could have been:

library(BBmisc)
#### standardizing
normalize(dataToCluster,method='standardize')

### normalizing
normalize(dataToCluster,method='range',range=c(0,10))

II. Compute the DISTANCE MATRIX:

  1. Set random seed: Multivariate techniques may use some random processes; so managing that process allows replication of results.
set.seed(999) 
  1. Decide distance method and compute distance matrix: We use gower as it allows for multiple data types. The default, the euclidian, works well when the data are numeric.
library(cluster)
distanceMatrix=daisy(x=dataToCluster, metric = "gower")
  1. Represent distances: Once you have the distanceMatrix you can use it to locate each case on a simpler space, in this case, let’s use a bidimensional representation. The function cmdscale can produce a two-dimension map of points using distanceMatrix:
projectedData = cmdscale(distanceMatrix, k=2)

The object projectedData is saving coordinates for each element in the data:

# save coordinates to original data frame:
fromPy$dim1 = projectedData[,1]
fromPy$dim2 = projectedData[,2]

# see some:

fromPy[,c('dim1','dim2')][1:10,]
##          dim1         dim2
## 1  -0.4637066 -0.017476271
## 2  -0.4374335 -0.016755363
## 3  -0.4492530 -0.017650599
## 4  -0.4115496  0.002026328
## 5  -0.4318327 -0.012772446
## 6  -0.4280394 -0.017597666
## 7  -0.3980438 -0.013114093
## 8  -0.4004756 -0.016744500
## 9  -0.4125417 -0.010853684
## 10 -0.3835142 -0.011989627
  1. Use those points and see a “map” of the cases:
library(ggplot2)
base= ggplot(data=fromPy,
             aes(x=dim1, y=dim2,
                 label=Country)) 
base + geom_text(size=2)

Can you see some groups emerging?

An alternative is the dendogram:

# prepare hierarchical cluster
hc = hclust(distanceMatrix)
# very simple dendrogram
plot(hc,hang = -1,cex=0.5)

Compute Clusters

There are several techniques for clustering, each with pros and cons. We will review the hierarchical clustering here. As the name implies, it will create clusters by creating all the possible clusters, so it is up to us to identify how many clusters emerge. The dendogram helps to identify how many clusters can be found. The hierarchical approach has two different strategies. The agglomerative approach starts by considering each case (row) a cluster, and then, using the distances links the individual cases. It is not expected that all cases are linked during the first step. The second step will create more clusters, and so on. The linking process can also vary (we will use the ward linkage). On the other hand, the divisive approach starts by considering all the cases as one cluster, and then divides them.

1. Compute clustering suggestions

Using function fviz_nbclust from the library factoextra, we can see how many clustered are suggested.

  1. For hierarchical (agglomerative):
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_nbclust(dataToCluster, 
             hcut,
             diss=distanceMatrix,
             method = "gap_stat",
             k.max = 10,
             verbose = F,
             hc_func = "agnes")

  1. For hierarchical (divisive):
fviz_nbclust(dataToCluster, 
             hcut,
             diss=distanceMatrix,
             method = "gap_stat",
             k.max = 10,
             verbose = F,
             hc_func = "diana")

We could accept the number of cluster suggested or not. Let’s use the suggestion:

2. Compute clusters:

Here you can compute the clusters. I am using both, the aggregative and the divisive . Notice that you have to indicate a priori the amount of clusters required.

NumberOfClusterDesired=4


#library(factoextra)
res.agnes= hcut(distanceMatrix, 
                k = NumberOfClusterDesired,
                isdiss=TRUE,
                hc_func='agnes',
                hc_method = "ward.D2")

# Hierarchical technique- divisive approach
res.diana= hcut(distanceMatrix, 
                k = NumberOfClusterDesired,
                isdiss=TRUE,
                hc_func='diana',
                hc_method = "ward.D2")

3. Results from Clustering

3.1 Save results to original data frame:

fromPy$agn=as.factor(res.agnes$cluster)
fromPy$dia=as.factor(res.diana$cluster)

3.2 Verify ordinality in clusters: Each cluster has a label. Should the result represent some ordering, the labels does not reflect that necessarily. Below, I show that I used the columns Overallscore to get an idea of the real ordering that the labels represent.

aggregate(data=fromPy,
          Overallscore~agn,
          FUN=mean)
##   agn Overallscore
## 1   1     8.082250
## 2   2     6.272222
## 3   3     3.503220
## 4   4     2.627143
aggregate(data=fromPy,
          Overallscore~dia,
          FUN=mean)
##   dia Overallscore
## 1   1     8.309118
## 2   2     6.553333
## 3   3     5.128333
## 4   4     2.843019

You could recode these values so that the labels represent an ascending order.

library(dplyr)

fromPy$agn=dplyr::recode_factor(fromPy$agn, 
                  `1` = '4',`2`='3',`3`='2',`4`='1')
fromPy$dia=dplyr::recode_factor(fromPy$dia, 
                  `1` = '4',`2`='3',`3`='2',`4`='1')

4. Evaluate Results

The hierarchical clustering process returns the the silhouette information for each observation, a measure that of how well a case has been classified. Silhouettes vary from -1 to +1, where the higher the positive value the better classified a case is. Low positive values informs of poor clustering. Negative values informs of wrongly clustered cases.

4.1 Plot silhouettes

fviz_silhouette(res.agnes)
##   cluster size ave.sil.width
## 1       1   40          0.42
## 2       2   45          0.25
## 3       3   59          0.19
## 4       4   21          0.34

library(factoextra)
fviz_silhouette(res.diana)
##   cluster size ave.sil.width
## 1       1   34          0.36
## 2       2   24          0.14
## 3       3   54          0.13
## 4       4   53          0.27

4.2 Detecting cases wrongly clustered

  1. Save individual silhouettes:

Previos results have saved important information. Let me keep the negative sihouette values:

agnEval=data.frame(res.agnes$silinfo$widths)
diaEval=data.frame(res.diana$silinfo$widths)

agnPoor=rownames(agnEval[agnEval$sil_width<0,])
diaPoor=rownames(diaEval[diaEval$sil_width<0,])

Now, I can see what countries are not well clustered:

library("qpcR") 
bad_Clus=as.data.frame(qpcR:::cbind.na(sort(agnPoor),
                                       sort(diaPoor)))
names(bad_Clus)=c("agn","dia")
bad_Clus
##                  agn             dia
## 1            Algeria          Guyana
## 2         Bangladesh           Italy
## 3              Chile North Macedonia
## 4            Croatia    Sierra Leone
## 5             Cyprus       Singapore
## 6  Equatorial Guinea            <NA>
## 7              Gabon            <NA>
## 8            Hungary            <NA>
## 9             Jordan            <NA>
## 10        Kyrgyzstan            <NA>
## 11           Lesotho            <NA>
## 12        Madagascar            <NA>
## 13           Morocco            <NA>
## 14  Papua New Guinea            <NA>
## 15         Sri Lanka            <NA>
## 16            Turkey            <NA>

How to compare clustering?

  1. Color the maps:

    • For Hierarchical AGNES:
base= ggplot(data=fromPy,
             aes(x=dim1, y=dim2,
                 label=Country)) 
agnPlot=base + labs(title = "AGNES") + geom_point(size=2,
                                              aes(color=agn),
                                              show.legend = T) 
-   For Hierarchical DIANA:
diaPlot=base + labs(title = "DIANA") + geom_point(size=2,
                                              aes(color=dia),
                                              show.legend = T) 
  1. Compare visually:
library(ggpubr)

ggarrange(agnPlot, diaPlot,ncol = 2,common.legend = T)

  1. Annotating outliers:

    • Prepare labels:
# If name of country in black list, use it, else get rid of it
LABELdia=ifelse(fromPy$Country%in%diaPoor,fromPy$Country,"")
LABELagn=ifelse(fromPy$Country%in%agnPoor,fromPy$Country,"")
- Prepare plot with the outlying labels:
library(ggrepel)
diaPlot=diaPlot + 
        geom_text_repel(aes(label=LABELdia),
                        max.overlaps=50,
                        min.segment.length =unit(0,'lines'))
agnPlot= agnPlot + 
         geom_text_repel(aes(label=LABELagn),
                         max.overlaps = 50,
                         min.segment.length = unit(0, 'lines'))
- Plot and compare:
ggarrange(agnPlot, 
          diaPlot,
          ncol = 2,
          common.legend = T)

  1. Produce Dendograms for the results:
fviz_dend(res.agnes,
          k=NumberOfClusterDesired, 
          cex = 0.45, 
          horiz = T,
          main = "AGNES approach")

fviz_dend(res.diana,
          k=NumberOfClusterDesired, 
          cex = 0.45, 
          horiz = T,
          main = "DIANA approach")

Let’s compare these clusters with the levels proposed by The Economist:

table(fromPy$Regimetype,fromPy$agn)
##                   
##                     4  3  2  1
##   Authoritarian     0  0 38 21
##   Hybrid regime     0 15 18  0
##   Flawed democracy 20 30  3  0
##   Full democracy   20  0  0  0
table(fromPy$Regimetype,fromPy$dia)
##                   
##                     4  3  2  1
##   Authoritarian     0  2 13 44
##   Hybrid regime     0  0 24  9
##   Flawed democracy 14 22 17  0
##   Full democracy   20  0  0  0

This way, you can use the clustering results to validate other classifications done theoretically or by simpler techniques (i.e. averaging).

FACTOR ANALYSIS

Simply speaking, this technique tries to express in one (or few) dimension(s) the behavior of several others. FA assumes that the several input variables have ‘something’ in common, there is something latent that the set of input variables represent.

Let follow this steps:

1. Subset our original data frame:

Our dataForFA has the same data as the one we used for clustering.

dataForFA=dataToCluster

We know that we have two sets of variables, one related to democracy and other to fragility, all of them computed at the country level.

2. Compute the correlations:

In Factor analysis we need a measure of similarity among variables (not cases). Let me compute the heterogenous correlation matrix (Pearson correlations between numeric variables, polyserial correlations between numeric and ordinal variables, and polychoric correlations between ordinal variables).

library(polycor)
corMatrix=polycor::hetcor(dataForFA)$correlations

We can visualize this matrix:

library(ggcorrplot)

ggcorrplot(corMatrix,
           type = "lower") + 
          theme(axis.text.x  = element_text(size = 5),
                axis.text.y  = element_text(size = 5))

You should notice that the set of variables that belong to a concept are correlated among one another. Variables from different concepts should have a low correlation.

3. Check Conditions

3.1 The amount of data should be enough for the correlation process:

library(psych)
psych::KMO(corMatrix) 
## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = corMatrix)
## Overall MSA =  0.94
## MSA for each item = 
## Electoralprocessandpluralism      Functioningofgovernment 
##                         0.92                         0.96 
##       Politicalparticipation             Politicalculture 
##                         0.94                         0.96 
##               Civilliberties          C1SecurityApparatus 
##                         0.92                         0.95 
##        C2FactionalizedElites             C3GroupGrievance 
##                         0.94                         0.93 
##                    E1Economy         E2EconomicInequality 
##                         0.97                         0.94 
##   E3HumanFlightandBrainDrain            P1StateLegitimacy 
##                         0.96                         0.94 
##             P2PublicServices                P3HumanRights 
##                         0.91                         0.94 
##       S1DemographicPressures            S2RefugeesandIDPs 
##                         0.94                         0.94 
##       X1ExternalIntervention 
##                         0.94

3.2 The correlation matrix should not be an identity matrix:

# is identity matrix?
cortest.bartlett(corMatrix,n=nrow(dataForFA))$p.value>0.05
## [1] FALSE

3.2. The correlation matrix should not be singular

library(matrixcalc)

is.singular.matrix(corMatrix)
## [1] TRUE

If some conditions fail you may not expect a reliable result, however, you may continue to see the sources of the flaws.

4. Get suggestions for amount of clusters

fa.parallel(dataForFA, fa = 'fa',correct = T,plot = F)
## Parallel analysis suggests that the number of factors =  2  and the number of components =  NA

5. Compute the Factors

library(GPArotation)
resfa <- fa(dataForFA,
            nfactors = 2,
            cor = 'mixed',
            rotate = "varimax",
            fm="minres")

6. Explore results

### see results
print(resfa$loadings)
## 
## Loadings:
##                              MR1    MR2   
## Electoralprocessandpluralism -0.215 -0.885
## Functioningofgovernment      -0.516 -0.744
## Politicalparticipation       -0.294 -0.743
## Politicalculture             -0.427 -0.570
## Civilliberties               -0.284 -0.930
## C1SecurityApparatus           0.735  0.465
## C2FactionalizedElites         0.584  0.645
## C3GroupGrievance              0.407  0.487
## E1Economy                     0.799  0.304
## E2EconomicInequality          0.795  0.380
## E3HumanFlightandBrainDrain    0.780  0.246
## P1StateLegitimacy             0.479  0.820
## P2PublicServices              0.885  0.330
## P3HumanRights                 0.466  0.800
## S1DemographicPressures        0.841  0.305
## S2RefugeesandIDPs             0.685  0.384
## X1ExternalIntervention        0.714  0.424
## 
##                  MR1   MR2
## SS loadings    6.503 6.090
## Proportion Var 0.383 0.358
## Cumulative Var 0.383 0.741

You can see better using a cutoff:

print(resfa$loadings,cutoff = 0.5)
## 
## Loadings:
##                              MR1    MR2   
## Electoralprocessandpluralism        -0.885
## Functioningofgovernment      -0.516 -0.744
## Politicalparticipation              -0.743
## Politicalculture                    -0.570
## Civilliberties                      -0.930
## C1SecurityApparatus           0.735       
## C2FactionalizedElites         0.584  0.645
## C3GroupGrievance                          
## E1Economy                     0.799       
## E2EconomicInequality          0.795       
## E3HumanFlightandBrainDrain    0.780       
## P1StateLegitimacy                    0.820
## P2PublicServices              0.885       
## P3HumanRights                        0.800
## S1DemographicPressures        0.841       
## S2RefugeesandIDPs             0.685       
## X1ExternalIntervention        0.714       
## 
##                  MR1   MR2
## SS loadings    6.503 6.090
## Proportion Var 0.383 0.358
## Cumulative Var 0.383 0.741

The previous results serve to indicate if variables group in a meaningful way. In our example, you want to know if the indicators in each set are grouped together. These previous results can alse be visualized:

fa.diagram(resfa,main = "EFA results")

7. Analyse the results

7.1 How are indicators contributing to the process?

# The higher the better
sort(resfa$communality)
##             C3GroupGrievance             Politicalculture 
##                    0.4034923                    0.5068953 
##            S2RefugeesandIDPs       Politicalparticipation 
##                    0.6161475                    0.6393868 
##   E3HumanFlightandBrainDrain       X1ExternalIntervention 
##                    0.6690640                    0.6900314 
##                    E1Economy          C1SecurityApparatus 
##                    0.7306222                    0.7567213 
##        C2FactionalizedElites         E2EconomicInequality 
##                    0.7571858                    0.7754234 
##       S1DemographicPressures      Functioningofgovernment 
##                    0.8000285                    0.8202032 
## Electoralprocessandpluralism                P3HumanRights 
##                    0.8298193                    0.8575290 
##             P2PublicServices            P1StateLegitimacy 
##                    0.8920841                    0.9021443 
##               Civilliberties 
##                    0.9462528

7.2 What indicators are loading in more than one factor?

# closer to 2, 3, etc?
# closer to zero?
sort(resfa$complexity)
## Electoralprocessandpluralism               Civilliberties 
##                     1.118078                     1.184325 
##   E3HumanFlightandBrainDrain       S1DemographicPressures 
##                     1.197527                     1.258663 
##             P2PublicServices                    E1Economy 
##                     1.273266                     1.283032 
##       Politicalparticipation         E2EconomicInequality 
##                     1.306003                     1.433648 
##            S2RefugeesandIDPs                P3HumanRights 
##                     1.572465                     1.607907 
##            P1StateLegitimacy       X1ExternalIntervention 
##                     1.611564                     1.628395 
##          C1SecurityApparatus      Functioningofgovernment 
##                     1.691080                     1.782136 
##             Politicalculture             C3GroupGrievance 
##                     1.853424                     1.939266 
##        C2FactionalizedElites 
##                     1.980759

8. Make a decision

  • Can we use these results? Maybe not.
  • Can we improve these results? You can try to eliminate variables, interpreting the results.

9. Improve the Factor Analysis

9.1 Let’s remove some variables. These belong to ‘fragility’ but may be close to ‘democracy’:

ps=c("P1StateLegitimacy","P2PublicServices","P3HumanRights")
notPs=setdiff(names(dataForFA),ps)
dataForFA2=dataForFA[,notPs]

9.2 Recompute correlations

# esta es:
library(polycor)
corMatrix2=polycor::hetcor(dataForFA2)$correlations

9.3 Recheck conditions:

  1. KMO
library(psych)
psych::KMO(corMatrix2) 
## Kaiser-Meyer-Olkin factor adequacy
## Call: psych::KMO(r = corMatrix2)
## Overall MSA =  0.93
## MSA for each item = 
## Electoralprocessandpluralism      Functioningofgovernment 
##                         0.88                         0.96 
##       Politicalparticipation             Politicalculture 
##                         0.93                         0.94 
##               Civilliberties          C1SecurityApparatus 
##                         0.89                         0.97 
##        C2FactionalizedElites             C3GroupGrievance 
##                         0.96                         0.90 
##                    E1Economy         E2EconomicInequality 
##                         0.96                         0.90 
##   E3HumanFlightandBrainDrain       S1DemographicPressures 
##                         0.95                         0.89 
##            S2RefugeesandIDPs       X1ExternalIntervention 
##                         0.93                         0.93
  1. Bartlett
cortest.bartlett(corMatrix2,n=nrow(dataForFA2))$p.value>0.05
## [1] FALSE
  1. Singularity
library(matrixcalc)

is.singular.matrix(corMatrix2)
## [1] FALSE

9.4. Get suggestions

fa.parallel(dataForFA2, fa = 'fa',correct = T,plot = F)
## Parallel analysis suggests that the number of factors =  2  and the number of components =  NA

9.5 Compute factors

library(GPArotation)
resfa <- fa(dataForFA2,
            nfactors = 2,
            cor = 'mixed',
            rotate = "varimax",
            fm="minres")

9.6 Explore results

Here, numerically

print(resfa$loadings,cutoff = 0.5)
## 
## Loadings:
##                              MR1    MR2   
## Electoralprocessandpluralism         0.905
## Functioningofgovernment      -0.549  0.726
## Politicalparticipation               0.749
## Politicalculture                     0.530
## Civilliberties                       0.932
## C1SecurityApparatus           0.767       
## C2FactionalizedElites         0.645 -0.578
## C3GroupGrievance                          
## E1Economy                     0.812       
## E2EconomicInequality          0.767       
## E3HumanFlightandBrainDrain    0.791       
## S1DemographicPressures        0.817       
## S2RefugeesandIDPs             0.710       
## X1ExternalIntervention        0.755       
## 
##                  MR1   MR2
## SS loadings    5.601 4.369
## Proportion Var 0.400 0.312
## Cumulative Var 0.400 0.712

Visually:

fa.diagram(resfa,main = "EFA results (2)")

9.7 Analyse new results

  • Communalities:
sort(resfa$communality)
##             C3GroupGrievance             Politicalculture 
##                    0.4008682                    0.5022138 
##            S2RefugeesandIDPs       Politicalparticipation 
##                    0.6331073                    0.6544139 
##   E3HumanFlightandBrainDrain       X1ExternalIntervention 
##                    0.6723772                    0.7176851 
##         E2EconomicInequality                    E1Economy 
##                    0.7265745                    0.7338534 
##        C2FactionalizedElites       S1DemographicPressures 
##                    0.7506536                    0.7516278 
##          C1SecurityApparatus      Functioningofgovernment 
##                    0.7566777                    0.8290041 
## Electoralprocessandpluralism               Civilliberties 
##                    0.8723964                    0.9676060
  • Complexity:
sort(resfa$complexity)
## Electoralprocessandpluralism   E3HumanFlightandBrainDrain 
##                     1.129833                     1.147993 
##                    E1Economy               Civilliberties 
##                     1.224932                     1.226998 
##       S1DemographicPressures       Politicalparticipation 
##                     1.247477                     1.323574 
##         E2EconomicInequality            S2RefugeesandIDPs 
##                     1.445441                     1.480002 
##       X1ExternalIntervention          C1SecurityApparatus 
##                     1.484753                     1.529322 
##      Functioningofgovernment             Politicalculture 
##                     1.862299                     1.972808 
##        C2FactionalizedElites             C3GroupGrievance 
##                     1.976024                     1.994779

It is not that we have the prefect solution, but you need to eventually stop.

10. Compute the scores

Each factor is represented by an score:

summary(resfa$scores)
##       MR1               MR2         
##  Min.   :-2.3475   Min.   :-2.2907  
##  1st Qu.:-0.7799   1st Qu.:-0.8327  
##  Median : 0.1728   Median : 0.2267  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.7639   3rd Qu.: 0.8467  
##  Max.   : 1.6656   Max.   : 1.6075

As you see, the scores are not in the range from zero to ten; let’s make the change:

library(BBmisc)
efa_scores=normalize(resfa$scores, 
                       method = "range", 
                       margin=2, # by column
                       range = c(0, 10))
# save the scores in the data
fromPy$Fragile_efa=efa_scores[,1]
fromPy$Demo_efa=efa_scores[,2]

You do not normally have a means to validate the scores, but our example data has pre computed scores. Let’s use those to see if our scores make sense.

  • Democracy:
library("ggpubr")
ggscatter(data=fromPy, x = "Overallscore", y = "Demo_efa", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "DemoIndex (original)", ylab = "DemoIndex (efa)")
## `geom_smooth()` using formula 'y ~ x'

  • Fragility
ggscatter(data=fromPy, x = "Total", y = "Fragile_efa", 
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "FragileIndex (original)", ylab = "FragileIndex (efa)")
## `geom_smooth()` using formula 'y ~ x'